Psychometric Measurement of Forecasters Using the Wisdom of Crowds

FRI Research Discussion 2026-01-16

Jessica Helmer, Sophie Ma Zhu, Nikolay Petrov, Ezra Karger, Mark Himmelstein

Efficient Skill Measurement

Intersubjective scoring

○ Reduces measurement error due to variability in the outcome
○ Fewer items needed

Cognitive tasks

○ Identifying top-performing items within the Forecasting Proficiency Test’s cognitive tasks
○ Shorter in time, maintain (or increase?) predictive ability

Intersubjective Measures

Measuring Forecasting Skill

Typically assess forecasting skill by examining the squared distance between a forecaster’s forecast and some scoring criterion over many items

\[ X_n = \frac{1}{K} \sum_{k=1}^K (F_{nk} - C_k)^2 \]

\(k\) indexing \(K\) items,
\(F_{nk}\) representing forecaster \(n\)’s forecast on item \(k\), and
\(C_k\) representing a scoring criterion for item \(k\)

Measuring Forecasting Skill

\[ X_n = \frac{1}{K} \sum_{k=1}^K (F_{nk} - C_k)^2 \]

In typical testing, \(C_k\) would be fixed and known

○ What’s the capital of ????

In forecasting, item outcome not known at time of testing

○ What will the population of ??? be in 2050?

“Correct answer” more like a random variable

Measuring Forecasting Skill

“Correct answer” more like a random variable

E.g., for an hypothetical item \(k\) that is normally distributed with an unknown mean and standard deviation, we can represent this as \[ O_k \sim \mathcal{N}(\mu_k, \sigma_k) \]

An optimal point forecast for this item \(F_k\) would be \(\mu_k\), so we would ideally score \(F_k\) by comparing it to \(\mu_k\)

\[ X_n^{\mu_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - \mu_k)^2 \]

Measuring Forecasting Skill

But, since \(\mu_k\) is rarely known, we typically use the outcome \(O_k\) as the scoring criterion instead.

\[X_n^{O_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - O_k)^2 \text{ where } O_k \sim \mathcal{N}(\mu_k, \sigma_k)\]

Measurement error is clear if we rewrite this as

\[ X_n^{O_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - (\mu_k + \epsilon_{O_k}))^2 \]

, and ideally, we would want to minimize \(\epsilon_{O_k}\).

Intersubjective Measures

Using peer comparisons to score forecasts instead of using outcomes

Wisdom of the crowds: when many predictions are aggregated, individual errors tend to cancel out.

○ Aggregate predictions tend to be more accurate than the average prediction within a crowd

Crowd Aggregates as Scoring Criterion

The logic of wisdom of crowds assumes that the distribution of forecasts for an item has the same mean as the outcome distribution \(\mu_{F_k} = \mu_{k}\).

\[ F_k \sim \mathcal{N} (\mu_k, \sigma_{F_k}) \]

Under the Central Limit Theorem, the average of many forecasts will converge on the average of the outcome distribution with error (inversely) proportional to the sample size of forecasters \(N\)

\[ A_k \sim \mathcal{N} (\mu_k, \frac{\sigma_{F_k}}{\sqrt{N}}) \]

Crowd Aggregates as Scoring Criterion

Substituting the crowd aggregate \(A_k\) for the scoring criterion,

\[X_n^{A_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - A_k)^2 \text{ where } A_k \sim \mathcal{N} (\mu_k, \frac{\sigma_{F_k}}{\sqrt{N}}) \]

\[ X_n^{A_k} = \frac{1}{K} \sum_{k=1}^K (F_{nk} - (\mu_k + \epsilon_{A_k}))^2 \]

As long as a crowd is unbiased (\(\mu_{F_k}=\mu_k\)) and sufficiently large \(\left(\frac{\sigma_{F_k}}{\sqrt N}<\sigma_k\right)\), on average \(\epsilon_{A_k}^2<\epsilon_{O_k}^2\) and the wisdom of that crowd \(A_k\) will be a better representation of the optimal point forecast \(\mu_k\) than \(O_k\).

Intersubjective Measures

Using peer comparisons to score forecasts instead of outcomes

Wisdom of the crowds: when many predictions are aggregated, individual errors tend to cancel out.

○ Aggregate predictions tend to be more accurate than the average prediction within a crowd


As long as a crowd is unbiased (\(\mu_{F_k}=\mu_k\)) and sufficiently large \(\left(\frac{\sigma_{F_k}}{\sqrt N}<\sigma_k\right)\), on average \(\epsilon_{A_k}^2<\epsilon_{O_k}^2\) and the wisdom of that crowd \(A_k\) will be a better representation of the optimal point forecast \(\mu_k\) than \(O_k\).

Simulation

\(N\) = 1,000 forecasters

\(\theta_n \sim \mathcal{N}(0, 1)\)
○ Varying skill levels

\(K\) = 1,000 items

\(O_k \sim \mathcal{N}(0, \sigma_k)\)
○ Specified item “noisiness”


Generated forecasts for each forecaster on each item

\(\theta_n\) defined the expected distance between an item’s expected outcome and a forecasters’ forecast

○ Mean of forecaster’s forecast distribution \(\mu_{nk} \sim \mathcal{N}(0, e^{\theta_n})\)
○ Forecaster’s forecast distribution on that item \(f_{nk} \sim \mathcal{N}(\mu_{nk}, \sigma_k)\)
○ Drew [.05, .25, .50, .75, .95] quantiles for each forecast

Simulation

Sampled combinations of:

\(N \in\) [4, 8, 16, 32] forecasters and
\(K \in\) [4, 8, 16, 32] items for
\(\sigma_k \in\) [1, 2]

Scored each forecast with:

○ ground truth
○ intersubjective scoring (squared distance from aggregate forecast of group of size \(N\))

Which scoring measure captures forecasters’ skill better?

Simulation: \(\sigma = 1\)

Intersubjective scoring correlation increases with \(N\)

For \(N\) ≥ 16, intersubjective scoring captures original skill parameter better than ground-truth

Simulation: \(\sigma = 2\)

Increased variance affects ground-truth scoring but not intersubjective scoring

Intersubjective scoring has stronger correlations with skill at lower \(N\) and \(K\) combinations than before

Intersubjective Measures

Types of intersubjective measures

Proxy Scoring

○ Scoring by a forecast’s distance from the aggregate forecast

Metapredictions

○ Explicit predictions about crowd aggregates
○ What would the average person predict that the population of ??? will be in 2050?

If a forecaster correctly identified that \(\mu_k \neq \mu_{F_k}\) (i.e., bias in the crowd), proxy scoring would incorrectly penalize the forecaster for reporting their true belief

Metapredictions explicitly ask about beliefs of \(\mu_{F_k}\) but also require more effort from the forecaster

Intersubjective Measures Existing Work

Tested for their real-time scoring ability but were surprisingly successful at predicting forecasting accuracy

○ More reliable than ground-truth scoring

But only explored with categorical forecasting items

Ground-Truth Outcome Intersubjective Outcome
Categorical Item Price of gas was between $3.50 and $4.00 Average probability judgement about the price of gas being between $3.50 and $4.00 was 52%
Continuous Item Price of gas was $3.78 Average prediction for price of gas was $3.80

Measurement scale confounds reliability boost from intersubjective measures

Present Study

  1. Are intersubjective measures still good predictors of forecasting accuracy with continuous items?
  1. Are proxy scores or metapredictions stronger predictors of forecasting accuracy?

Superforecasters

Previously identified skilled forecasters

Aggregating these superforecasters may serve as a better reference criterion than aggregating the general crowd

Proxy scoring with a superforecaster crowd aggregate has been an effective method

○ But has only been implemented between-subjects

Present Study

  1. Are intersubjective measures still good predictors of forecasting accuracy in the (superior reliability) Quantile Elicitation Format (QEF)?

  2. Are proxy scores or metapredictions stronger predictors of forecasting accuracy?

  1. Are superforecaster aggregates a better reference criterion than general crowd aggregates?

Methods

Final wave of a longitudinal forecasting study (N = 894)

Forecasts on six items in the QEF

Metapredictions after each item:

○ What do you think the average person would predict?
○ What do you think a superforecaster would predict?

Additional sample of N = 42 superforecasters

Scoring Methods

Scored each forecast with:

Ground Truth (own forecast distance from actual outcome)
○ proxy scoring (own forecast distance from crowd aggregate)
Proxy Crowd and Proxy Super
○ metaprediction accuracy (metaprediction distance from crowd aggregate)
Metaprediction Crowd and Metaprediction Super

Compared these scores to forecasting accuracy on separate set of thirty items

Aggregate Accuracy

Ground-truth score distributions

Superforecaster aggregate more accurate more often

Correlations

Superforecaster metapredictions and proxy scores strongest

Contributions to Forecasting Proficiency

How much variance in forecaster accuracy does each score explain?

Random effects for item and person, conducted dominance analysis

Contribution Proportion
Proxy Super .119 .225
Metaprediction Super .118 .223
Proxy Crowd .109 .206
Metaprediction Crowd .094 .178
Ground Truth .089 .168
\(R_{forecaster}^2\) .53 1.00

Select Crowds

Proxy scores effective way of finding select crowds to aggregate

Discussion

Intersubjective measures still effective measure of forecasting ability

○ More reliable picture of the truth

Discussion

Intersubjective measures still effective measure of forecasting ability

○ More reliable picture of the truth
○ Less influenced by unpredictability in items
○ Superforecaster aggregates particularly useful

Can reduce measurement error by modifying the scoring criterion

○ Rather than modifying the task

Cognitive Tasks

[Mostly placeholder content right now]

Forecasting Proficiency Test

Test to measure forecasting proficiency, have data on forecasting accuracy and cognitive tasks

Version Cognitive Tasks Appx. Cognitive Test Time \(R^2\) of Cognitive Tests Alone
1 None 0 min
2 DN-C, BU Hard 12 min .43
3 DN-C, BU Hard, ADMC-DR, NS, CFS 24 min .51
4 DN-C, BU Hard, ADMC-DR, NS, CFS, MR 36 min .53
5 DN-C, BU Hard, ADMC-DR, NS, CFS, MR, DN-S, BU Easy 48 min .54

Room for optimization at the item-level?

Optimization Game

Search algorithm for most informative (about forecasting accuracy) items

  • Iterates over all items within a set of tasks

  • Identifies item that, when added to the “test,” yields the maximum increase in \(R^2\) per testing time

○ Time is currently just total time of task divided by number of items (not accounting for “startup cost” when adding an item from a new task
\(R^2\) is from a single-level linear model with sum scores of each task predicting forecasting accuracy

Results: Search Algorithm

Same Pattern, Randomly Selected

Discussion

May be able to decrease testing time for cognitive tasks in the FPT

Some items seem to be adding more noise than predictive ability

Want to identify importance of uniquely informative items vs. randomly selected items from generally informative tasks

Project is very early, interested in feedback and suggestions!

References

Atanasov, P., & Himmelstein, M. (2023). Talent spotting in crowd prediction. In M. Seifert (Ed.), Judgment in Predictive Analytics. Springer.

Atanasov, P., Rescober, P., Stone, E., Swift, S. A., Servan-Schreiber, E., Tetlock, P., Ungar, L., & Mellers, B. (2017). Distilling the Wisdom of Crowds: Prediction Markets vs. Prediction Polls. Management Science, 63(3), 691–706. https://doi.org/10.1287/mnsc.2015.2374

Galton, F. (1907). Vox Populi. Nature, 75(1949), 450–451. https://doi.org/10.1038/075450a0

Himmelstein, M., Budescu, D. V., & Ho, E. H. (2023). The wisdom of many in few: Finding individuals who are as wise as the crowd. Journal of Experimental Psychology: General, 152(5), 1223–1244. https://doi.org/doi.org/10.1037/xge0001340

Himmelstein, M., Zhu, S. M., Petrov, N., Karger, E., Helmer, J., Livnat, S., Bennett, A., Hedley, P., & Tetlock, P. (2025). The Forecasting Proficiency Test: A General Use Assessment of Forecasting Ability. OSF. https://doi.org/10.31234/osf.io/a7kdx

Karger, E., Monrad, J., Mellers, B., & Tetlock, P. (2021). Reciprocal Scoring: A Method for Forecasting Unanswerable Questions. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3954498

Tetlock, P. E., & Gardner, D. (2015). Superforecasting: The Art and Science of Prediction. Crown.

Wilkening, T., Martinie, M., & Howe, P. D. L. (2022). Hidden Experts in the Crowd: Using Meta-Predictions to Leverage Expertise in Single-Question Prediction Problems. Management Science, 68(1), 487–508. https://doi.org/10.1287/mnsc.2020.3919

Zhu, S. M., Budescu, D. V., Petrov, N., Karger, E., & Himmelstein, M. (2024). The psychometric properties of probability and quantile forecasts. Preprint.

Thank you!

Questions?